CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
https://arxiv.org/abs/1911.00359
https://aclanthology.org/2020.lrec-1.494/
In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages.
Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language.